A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

نویسندگان

  • Sujan Kumar Saha
  • Pabitra Mitra
  • Sudeshna Sarkar
چکیده

Features used for named entity recognition (NER) are often high dimensional in nature. These cause overfitting when training data is not sufficient. Dimensionality reduction leads to performance enhancement in such situations. There are a number of approaches for dimensionality reduction based on feature selection and feature extraction. In this paper we perform a comprehensive and comparative study on different dimensionality reduction approaches applied to the NER task. To compare the performance of the various approaches we consider two Indian languages namely Hindi and Bengali. NER accuracies achieved in these languages are comparatively poor as yet, primarily due to scarcity of annotated corpus. For both the languages dimensionality reduction is found to improve performance of the classifiers. A Comparative study of the effectiveness of several dimensionality reduction techniques is presented in detail in this paper. 2011 Elsevier B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CRF-based Named Entity Recognition @ICON 2013

This paper describes performance of CRF based systems for Named Entity Recognition (NER) in Indian language as a part of ICON 2013 shared task. In this task we have considered a set of language independent features for all the languages. Only for English a language specific feature, i.e. capitalization, has been added. Next the use of gazetteer is explored for Bengali, Hindi and English. The ga...

متن کامل

Linguistic Issues in Language Technology – LiLT

This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian languages, namely Bengali and Hindi using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes langua...

متن کامل

Maximum Entropy Approach for Named Entity Recognition in Bengali and Hindi

This paper reports about the development of a Named Entity Recognition (NER) system in two leading Indian languages, namely Bengali and Hindi using the Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL) and tagged with a fine-grained Named Entity (NE) tagset of twelve tags. An appropr...

متن کامل

A Two Stage Language Independent Named Entity Recognition for Indian Languages

This paper describes about the development of a two stage hybrid Named Entity Recognition (NER) system for Indian Languages particularly for Hindi, Oriya, Bengali and Telugu. We have used both statistical Maximum Entropy Model (MaxEnt) and Hidden Markov Model (HMM) in this system. We have used variety of features and contextual information for predicting the various Named Entity (NE) classes. T...

متن کامل

Named Entity Recognition using Support Vector Machine: A Language Independent Approach

Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Knowl.-Based Syst.

دوره 27  شماره 

صفحات  -

تاریخ انتشار 2012